Pandas Tutorial

Software Carpentry, EITN, Paris, November 20th, 2015

Bartosz Teleńczuk

forked from the tutorial at EuroScipy 2015 by Joris Van den Bossche (Ghent University, Belgium)

Licensed under CC BY 4.0 Creative Commons

Content of this talk

  • Why do you need pandas?
  • Basic introduction to the data structures
  • Guided tour through some of the pandas features with two case studies: movie database and a case study about air quality

If you want to follow along, this is a notebook that you can view or run yourself:

Some imports:


In [7]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Let's start with a showcase

Case study: air quality in Europe

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:


In [2]:
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values='-9999')

In [3]:
data


Out[3]:
BETR801 BETN029 FR04037 FR04012
1998-01-01 00:00:00 NaN 16.0 NaN NaN
1998-01-01 01:00:00 NaN 13.0 NaN NaN
1998-01-01 02:00:00 NaN 12.0 NaN NaN
1998-01-01 03:00:00 NaN 12.0 NaN NaN
... ... ... ... ...
2012-12-31 20:00:00 16.5 2.0 16 47
2012-12-31 21:00:00 14.5 2.5 13 43
2012-12-31 22:00:00 16.5 3.5 14 42
2012-12-31 23:00:00 15.0 3.0 13 49

131265 rows × 4 columns

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?


In [4]:
data['1999':].resample('A').plot(ylim=[0,100])


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f458c4c4f28>

How many exceedances of the limit values?


In [5]:
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')


Out[5]:
<matplotlib.lines.Line2D at 0x7f458c1d15c0>

What is the difference in diurnal profile between weekdays and weekend?


In [6]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f458c0ca0f0>

We will come back to these example, and build them up step by step.

Why do you need pandas?

Why do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

  • Import data
  • Clean up messy data
  • Explore data, gain insight into data
  • Process and prepare your data for analysis
  • Analyse your data (together with scikit-learn, statsmodels, ...)

Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

  • Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
  • Pandas can also be thought of as R's data.frame in Python.

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

Key features

  • Fast, easy and flexible input/output for a lot of different data formats
  • Working with missing data (.dropna(), pd.isnull())
  • Merging and joining (concat, join)
  • Grouping: groupby functionality
  • Reshaping (stack, pivot)
  • Powerful time series manipulation (resampling, timezones, ..)
  • Easy plotting

Further reading

How can you help?

We need you!

Contributions are very welcome and can be in different domains:

  • reporting issues
  • improving the documentation
  • testing release candidates and provide feedback
  • triaging and fixing bugs
  • implementing new features
  • spreading the word

-> https://github.com/pydata/pandas